Page 1 of 1

first
«
1
»
last

Page Size: 10, 20, 50
Export: bibtex, xml, json, csv
Order by:

2015 Journal article Open Access

Utility-theoretic ranking for semiautomated text classification
Berardi G., Esuli A., Sebastiani F.
Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.Source: ACM transactions on knowledge discovery from data 10 (2015): 6. doi:10.1145/2742548
DOI: 10.1145/2742548
DOI: 10.48550/arxiv.1503.00491
Metrics:

See at: arXiv.org e-Print Archive Open Access | ACM Transactions on Knowledge Discovery from Data | ISTI Repository | dl.acm.org Restricted | ACM Transactions on Knowledge Discovery from Data | doi.org | CNR ExploRA

2015 Conference article Open Access

Multi-store metadata-based supervised mobile App classification
Berardi G., Esuli A., Fagni T., Sebastiani F.
The mass adoption of smartphone and tablet devices has boosted the growth of the mobile applications market. Confronted with a huge number of choices, users may encounter difficulties in locating the applications that meet their needs. Sorting applications into a user-defined classification scheme would help the app discovery process. Systems for automatically classifying apps into such a classification scheme are thus sorely needed. Methods for automated app classification have been proposed that rely on tracking how the app is actually used on users' mobile devices; however, this approach can lead to privacy issues. We present a system for classifying mobile apps into user-defined classification schemes which instead leverages information publicly available from the online stores where the apps are marketed. We present experimental results obtained on a dataset of 5,993 apps manually classified under a classification scheme consisting of 50 classes. Our results indicate that automated app classification can be performed with good accuracy, at the same time preserving users' privacy.Source: 30th Annual ACM Symposium on Applied Computing, pp. 585–588, Salamanca, ES, 13-17/04/2015
DOI: 10.1145/2695664.2695997
Metrics:

See at: ISTI Repository Open Access | dl.acm.org Restricted | doi.org | CNR ExploRA

2015 Conference article Open Access

Semi-automated text classification for sensitivity identification
Berardi G., Esuli A., Macdonald C., Ounis I., Sebastiani F.
Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.Source: 24th ACM International Conference on Information and Knowledge Management, pp. 1711–1714, Melbourne, AU, 19-23/10/2015
DOI: 10.1145/2806416.2806597
Metrics:

See at: Enlighten Open Access | ISTI Repository | dl.acm.org Restricted | doi.org | CNR ExploRA

2015 Conference article Restricted

Word embeddings go to Italy: A comparison of models and training datasets
Berardi G., Esuli A., Marcheggiani D.
In this paper we present some preliminary results on the generation of word embeddings for the Italian language. We compare two popular word representation models, word2vec and GloVe, and train them on two datasets with different stylistic properties. We test the generated word embeddings on a word analogy test derived from the one originally proposed for word2vec, adapted to capture some of the linguistic aspects that are specific of Italian. Results show that the tested models are able to create syntactically and semantically meaningful word embeddings despite the higher morphological complexity of Italian with respect to English. Moreover, we have found that the stylistic properties of the training dataset plays a relevant role in the type of information captured by the produced vectors.Source: 6th Italian Information Retrieval Workshop, Cagliari, 25-26/05/2015

See at: ceur-ws.org Restricted | CNR ExploRA

2015 Conference article Open Access

A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining
Jimenez Zafra S., Berardi G., Esuli A., Marcheggiani D., Martin-Valdivia M. T., Moreo Fernández A.
We present the Trip-MAML dataset, a Multi-Lingual dataset of hotel reviews that have been manually annotated at the sentence-level with Multi-Aspect sentiment labels. This dataset has been built as an extension of an existent English-only dataset, adding documents written in Italian and Spanish. We detail the dataset construction process, covering the data gathering, selection, and annotation. We present inter-annotator agreement figures and baseline experimental results, comparing the three languages. Trip-MAML is a multi-lingual dataset for aspect-oriented opinion mining that enables researchers (i) to face the problem on languages other than English and (ii) to the experiment the application of cross-lingual learning methods to the taskSource: Conference on Empirical Methods in Natural Language Processing, pp. 2533–2538, Lisbon, 17-21/0972015
DOI: 10.18653/v1/d15-1302
Metrics:

See at: ISTI Repository Open Access | www.aclweb.org | www.aclweb.org | doi.org Restricted | www.scopus.com | CNR ExploRA

2015 Conference article Open Access

Classifying Websites by industry sector: a study in feature design
Berardi G., Esuli A., Fagni T., Sebastiani F.
Classifying companies by industry sector is an important task in finance, since it allows investors and research analysts to analyse specific subsectors of local and global markets for investment monitoring and planning purposes. Traditionally this classification activity has been performed manually, by dedicated specialists carrying out in-depth analysis of a company's public profile. However, this is more and more unsuitable in nowadays's globalised markets, in which new companies spring up, old companies cease to exist, and existing companies refocus their efforts to different sectors at an astounding pace. As a result, tools for performing this classification automatically are increasingly needed. We address the problem of classifying companies by industry sector via the automatic classification of their websites, since the latter provide rich information about the nature of the company and market segment it targets. We have built a website classification system and tested its accuracy on a dataset of more than 20,000 company websites classified according to a 2-level taxonomy of 216 leaf classes explicitly designed for market research purposes. Our experimental study provides interesting insights as to which types of features are the most useful for this classification task.Source: 30th Annual ACM Symposium on Applied Computing, pp. 1053–1059, Salamanca, ES, 13-17/04/2015
DOI: 10.1145/2695664.2695722
Metrics:

See at: ISTI Repository Open Access | dl.acm.org Restricted | doi.org | CNR ExploRA

2015 Conference article Open Access

On the impact of Entity Linking in microblog real-time filtering
Berardi G., Ceccarelli D., Esuli A., Marcheggiani D.
Microblogging is a model of content sharing in which the temporal locality of posts with respect to important events, either of foreseeable or unforeseeable nature, makes applications of real-time filtering of great practical interest. We propose the use of Entity Linking (EL) in order to improve the retrieval effectiveness, by enriching the representation of microblog posts and filtering queries. EL is the process of recognizing in an unstructured text the mention of relevant entities described in a knowledge base. EL of short pieces of text is a difficult task, but it is also a scenario in which the information EL adds to the text can have a substantial impact on the retrieval process. We implement a start-of-the-art filtering method, based on the best systems from the TREC Microblog track real-time adhoc retrieval and filtering tasks , and extend it with a Wikipedia-based EL method. Results show that the use of EL significantly improves over non-EL based versions of the filtering methods. Copyright is held by the owner/author(s).Source: SAC'15 - 30th Annual ACM Symposium on Applied Computing, pp. 1066–1071, Salamanca, Spain, 13-17 April 2015
DOI: 10.1145/2695664.2695761
DOI: 10.48550/arxiv.1611.03350
Metrics:

See at: arXiv.org e-Print Archive Open Access | arxiv.org | ISTI Repository | dl.acm.org Restricted | doi.org | doi.org | CNR ExploRA

2014 Journal article Restricted

Optimising human inspection work in automated verbatim coding
Berardi G., Esuli A., Sebastiani F.
Automatic verbatim coding technology is essential in many contexts in which, either because of the sheer size of the dataset we need to code, or because of demanding time constraints, or because of cost-effectiveness issues, manual coding is not a viable option. However, in some of these contexts the accuracy standards imposed by the customer may be too high for today's automated verbatim coding technology; this means that human coders may need to devote some time to inspecting (and correcting where appropriate) the most problematic autocoded verbatims, with the goal of increasing the accuracy of the coded set. We discuss a software tool for optimising the human coders' work, i.e., a tool that minimizes the amount of human inspection required to reduce the overall error down to a desired level, or that (equivalently) maximises the reduction in the overall error achieved for an available amount of human inspection work.Source: International journal of market research 56 (2014): 489–512. doi:10.2501/IJMR-2014-032
DOI: 10.2501/ijmr-2014-032
Metrics:

See at: International Journal of Market Research Restricted | www.scopus.com | www.scopus.com | CNR ExploRA

2014 Doctoral thesis Unknown

Semi-automated text classification
Berardi G.
There is currently a high demand for information systems that automatically analyze textual data, since many organizations, both private and public, need to process large amounts of such data as part of their daily routine, an activity that cannot be performed by means of human work only. One of the answers to this need is text classification (TC), the task of automatically labelling textual documents from a domain D with thematic categories from a predefined set C. Modern text classification systems have reached high efficiency standards, but cannot always guarantee the labelling accuracy that applications demand. When the level of accuracy that can be obtained is insufficient, one may revert to processes in which classification is performed via a combination of automated activity and human effort. One such process is semi-automated text classification (SATC), which we define as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected such increase is maximized. An obvious strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top-ranked. In this dissertation we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose new effectiveness measures for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a ranked list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measures, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error. We therefore explore the task of SATC and the potential of our methods, in multiple text classification contexts. This dissertation is, to the best of our knowledge, the first to systematically address the task of semi-automated text classification.

See at: CNR ExploRA

2013 Journal article Restricted

Endorsements and rebuttals in blog distillation
Berardi G., Esuli A., Sebastiani F., Silvestri F.
In this paper we test a new approach to blog distillation, defined as the task in which, given a user query, the system ranks the blogs in descending order of relevance to the query topic. Our approach is based on the idea of adding a link analysis phase to the standard retrieval-by-topicality phase. However, differently from other link analysis methods, we check whether a given hyperlink is a citation with a positive or a negative nature, i.e., if it expresses approval or disapproval of the hyperlinked page by the hyperlinking page. This allows us to test the hypothesis that distinguishing approval from disapproval brings about benefits in the blog distillation task. We have tested our method on the Blogs08 collection used in the last two editions (2009 and 2010) of the TREC Blog Track, a collection consisting of more than one million blogs and more than 28 million blog posts. Unfortunately, the experimental results seem to disconfirm the above hypothesis, due to the low level of connectivity of the collection which severely limits the impact of a link analysis phase (and, a fortiori, of the attempt to distinguish endorsements from rebuttals). Application contexts other than the blogosphere (such as, e.g., the domain of eBay transactions) are probably more suited to such an approach.Source: Information sciences 249 (2013): 47. doi:10.1016/j.ins.2013.05.037
DOI: 10.1016/j.ins.2013.05.037
Metrics:

See at: Information Sciences Restricted | www.sciencedirect.com | CNR ExploRA

2013 Contribution to book Restricted

The Tanl tagger for named entity recognition on transcribed broadcast news at Evalita 2011
Berardi G., Attardi G., Dei Rossi S., Simi M.
The Tanl tagger is a configurable tagger based on a Maximum Entropy classifier, which uses dynamic programming to select the best sequences of tags. We applied it to the NER tagging task, customizing the set of features to use, and including features deriving from dictionaries extracted from the training corpus. The final accuracy of the tagger is further improved by applying simple heuristic rules.Source: Evaluation of Natural Language and Speech Tools for Italian. International Workshop. Revised selected papers, edited by Bernardo Magnini, Francesco Cutugno, Mauro Falcone, Emanuele Pianta, pp. 116–125. Berlin: Springer, 2013
DOI: 10.1007/978-3-642-35828-9_13
Metrics:

See at: doi.org Restricted | link.springer.com | CNR ExploRA

2012 Conference article Open Access

Blog distillation via sentiment-sensitive link analysis.
Berardi G., Esuli A., Sebastiani F., Silvestri F.
In this paper we approach blog distillation by adding a link analysis phase to the standard retrieval-by-topicality phase, where we also we check whether a given hyperlink is a citation with a positive or a negative nature. This allows us to test the hypothesis that distinguishing approval from disapproval brings about benefits in blog distillation.Source: Natural Language Processing and Information Systems. 17th International Conference on Applications of Natural Language to Information Systems, pp. 228–233, Groningen, The Netherlands, 26-28 June 2012
DOI: 10.1007/978-3-642-31178-9_26
Metrics:

See at: nmis.isti.cnr.it Open Access | doi.org Restricted | www.scopus.com | www.springerlink.com | CNR ExploRA

2012 Conference article Open Access

A utility-theoretic ranking method for semi-automated text classification.
Berardi G., Esuli A., Sebastiani F.
In Semi-Automated Text Classification (SATC) an automatic classifier Phi labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by Phi to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D'. An obvious strategy is to rank D so that the documents that Phi has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.Source: The 35th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 961–970, Portland, Oregon, USA, 12-16 August 2012
DOI: 10.1145/2348283.2348411
Metrics:

See at: nmis.isti.cnr.it Open Access | doi.org Restricted | CNR ExploRA

2012 Conference article Open Access

Metadata enrichment services for the Europeana digital library.
Berardi G., Esuli A., Gordea S., Marcheggiani D., Sebastiani F.
We demonstrate a metadata enrichment system for the Europeana digital library. The system allows different institutions which provide to Europeana pointers (in the form of metadata records - MRs) to their content to enrich their MRs by classifying them under a classification scheme of their choice, and to extract/highlight entities of significant interest within the MRs themselves. The use of a supervised learning metaphor allows each content provider (CP) to generate classifiers and extractors tailored to the CP's specific needs, thus allowing the tool to be effectively available to the multitude (2000+) of Europeana CPs.Source: Theory and Practice of Digital Libraries. Second International Conference, pp. 508–511, Paphos, Cyprus, 23-27 September 2012
DOI: 10.1007/978-3-642-33290-6_61
Metrics:

See at: nmis.isti.cnr.it Open Access | doi.org Restricted | link.springer.com | CNR ExploRA

2012 Conference article Open Access

ISTI@ TREC Microblog track 2012: real-time filtering through supervised learning
Berardi G., Esuli A., Marcheggiani D.
Our approach to the microblog filtering task is based on learning a relevance classifier from an initial training set of relevant and non relevant tweets, generated by using a simple retrieval method. The classifier is then retrained using the (simulated) user feedback collected during the training process, in order to improve its accuracy as the filtering process goes on. In the official runs the system scored low effectiveness values, suffering a strong imbalance toward recall.Source: TRC 2012 - 21th Text Retrieval Conference, Gaithersburg, US, 6-9 November 2012

See at: trec.nist.gov Open Access | CNR ExploRA

2011 Conference article Restricted

ISTI @ TREC Microblog Track 2011: Exploring the Use of Hashtag Segmentation and Text Quality Ranking
Berardi Giacomo, Esuli Andrea, Marcheggiani Diego, Sebastiani Fabrizio
In the first year of the TREC Micro Blog track, our participation has focused on building from scratch an IR system based on the Whoosh IR library. Though the design of our system (CipCipPy) is pretty standard it includes three ad-hoc solutions for the track: (i) a dedicated indexing function for hashtags that automatically recognizes the distinct words composing an hashtag, (ii) expansion of tweets based on the title of any referred Web page, and (iii) a tweet ranking function that ranks tweets in results by their content quality, which is compared against a reference corpus of Reuters news. In this preliminary paper we describe all the components of our system, and the efficacy scored by our runs. The CipCipPy system is available under a GPL license.Source: 20th Text Retrieval Conference, TREC 2011, Gaithersburg, US, November 15-18 2011

See at: trec.nist.gov Restricted | CNR ExploRA

2011 Conference article Restricted

Blog distillation via sentiment-sensitive link analysis
Berardi Giacomo, Esuli Andrea, Sebastiani Fabrizio, Silvestri Fabrizio
In this paper we report a new approach to blog distillation, defined as the task in which, given a user query, the system ranks the blogs in descending order of relevance to the query topic. Our approach is based on the idea of adding a link analysis phase to the standard retrieval-by-topicality phase. However, differently from other link analysis methods, we try to analyse whether a given hyperlink is a citation with a positive or a negative nature, i.e., if it expresses approval or disapproval of the linked page by the linking page. We report the results of testing our method on the Blogs08 collection used in the 2008 and 2009 editions of the TREC Blog Track.Source: 2nd Italian Information Retrieval Workshop, IIR 2011, pp. 6–12, Milano, IT, 27-28 gennaio 2011

See at: ceur-ws.org Restricted | CNR ExploRA

first
«
1
»
last